1. Abstract

COVID-19 data from World Health Organization is analyzed. ANOVA is used to identify any different in mean case fatality rate (CFR) by the six WHO regions. Normality assumption is violated on the distribution of CFR. The non-parametric test, Kruskal-Wallis Test, is used instead. Both ANOVA and Kruskal-Wallis Test result in the same conclusion that not all mean CFRs are the same among the regions. Tukey post-hoc test introduced to find which pairs are significant different in mean CFRs. With 95% family-wise significance level, Eastern Mediterranean region has significant difference in mean CFR than other region except South-East Asia. In addition, spectral clustering method is applied to cluster the countries based on the log cumulative deceased cases and log cumulative infected cases. The resulting clusters do not reflect the WHO regions. More variables may need in order to obtain compariable clusters as WHO regions. The new clusters can represent different kind of bounding among the countries. The relationship of the countries in each cluster or between cluster can be studied in the future.

2. Introduction

We will explore the World Health Organization (WHO) COVID-19 data in this project. WHO collects the daily new infected cases and new deceased cases from each country. Futhermore, WHO categorizes the countries in six regions. They are Africa, Americas, Eastern Mediterranean, Europe, South-East Asia and Western Pacific.

The primary question of interest is to detect any difference in the case fatality rate, (CFR), defined in Onder et al. (2020), among the six WHO regions. CFR is defined as the cumulative infected cases divided by the cumulative deceased cases. The result can let WHO, other country’s leaders and public know how the CFR difference among the six WHO regions. It provides a ground to study the factors result in the difference in CFR. The key variables to answer the primary question are the cumulative infected cases, the cumulative deceased cases and which WHO region each country belong to. I decide to use the cumulative infected cases and the cumulative deceased cases up to March 1, 2022. Using the cumulative infected cases and the cumulative deceased cases, we can calculate CFR. An experimental unit is a country. The observed value is the CFR. The factor is the WHO regions. Each region is considered to be a level. Therefore, we have 6 treatments. The null hypothesis is that there is no difference in CFR among the 6 treatments. The alternative hypothesis is that at least a pair of regions have a difference in CFR. We are using Analysis of Variance to test the hypotheses.

The secondary question of interest is to explore the potential clusters with the cumulative infected cases and the cumulative deceased cases. In particular, we want to specify six clusters and see can we still obtain back the same 6 WHO regions. If not, we want to summarize the difference. The cluster analysis may help us to find another structure among these countries rather than the six regions assigned by WHO. Researchers can further study the similarity or dissimilarity among the six clusters. It could possibly find the potential factors which may lead to the difference in the paired relationship between the cumulative infected cases and the cumulative deceased cases. The countries in the cluster with the lowest or highest cumulative infected cases and cumulative deceased cases could provide us useful information such as the effectively policy to control the spread of COVID. The key variables are the cumulative infected cases and the cumulative deceased cases. To answer this question, I also use the cumulative infected cases and the cumulative deceased cases up to March 1, 2022. We obtain the bivariate data from each country. Using this bivariate data, we can put the countries into six clusters with spectral clustering.

3. Background

In this generation, no one is going to forget the deadly virus which put every people in panic - COVID-19. The official name of COVID-19 is the severe acute respiratory syndrome coronavirus 2 or SARS-CoV-2 (He et al., 2020). The first infected case was reported in Wuhan, Hubei province, China in December 2019. Since it is first discovered in 2019. Therefore, people names the virus as COVID-19. Scientists believe the virus was transmitted to humans from bats through unknown intermediary animals since the virus is also found inside bats in China (Singhal et al., 2020). Once human is infected, it can transmit to next person by inhalation or contact with contaminated droplets. The incubation period is from 2 days to 14 days. The symptoms include fever, cough, sore throat, breathlessness, fatigue, malaise among others. In most people, the symtoms are mild. However, some people may have severe symtoms such as the elderly or comorbidities (Shuja et al., 2021). Eventually it lends to pneumonia, acute respiratory distress syndrome (ARDS) and multi organ dysfunction. People with weak immune system or some comorbidities could die if they are infected.

In this project, we are going to analyze how the cumulative infected cases and deceased cases are different among the countries. The WHO COVID-19 data can be found in weekly WHO COVID-19 update. The data set consisted of time-series information regarding the number of new infected cases, the cumulative infected cases, origin country and its country code, WHO region and the number of new deceased cases. The target population is each country. US Department of Health and Human Services (2019) mentioned that the number of infected cases or the deceased cases are collected from hospitals. They are required to the information to the Federal government. Facilities need to report at the individual hospital level. The data set is maintained by WHO and updated constantly. From 01-03-2020 to 03-01-2022, there are 186993 records from this raw data. The dataset includes 8 columns. They are listed below.

Column Value Description
Date_reported Date The date the numbers are collected
Country_code Factor Two characters’ country code
Country Factor Full name of the country
WHO_region Factor The World Health Organization Regions. There are 7 categories. They are African Region (AFRO), Region of the Americas (AMRO), South-East Asian Region (SEARO), European Region (EURO), Eastern Mediterranean Region (EMRO), Western Pacific Region (WPR) and Other.
New_cases Integer The number of new cases reported on the day
Cumulative_cases Integer The cumulative cases up to today
New_deaths Integer The number of new deaths reported on the day
Cumulative_deaths Integer The cumulative death countsup to today

The data set is extensively used in different literatures. For example, Kraemer et al. (2020) studied how COVID-19 affect the human mobility and travel restrictions in China. Also, Tian et al. (2020) investigated the transmission control measures of COVID-19 in China. In addition, Dey et al. (2021) used the data set to study the epidemiological outbreak of COVID-19 through visual exploratory data analysis. WHO makes the data set opens source. It can encourages more scientific research works on COVID-19. The number of publications related to the data set is also growing exponentially. In this project, we are going to use different statistical methods to analyze the data set.

4. Descriptive Analysis

The data set covers 237 countries. There are 186993 records up to March 1, 2022. There are 789 records with the country labeled as “Other”. We are going to remove them from the data set. Then, the data set only includes 186204 records. Furthermore, the countries Democratic People’s Republic of Korea, Micronesia, Nauru, Niue, Pitcairn Islands, Saint Helena, Tokelau, Turkmenistanand Tovalu have 0 infected cases in the covered time period. Most of them are small islands. It is not surprising they report 0 case so far. For the Democratic People’s Republic of Korea or North Korea, the country’s leader do not like to share any information from his country. So, it is not surprsing we do not get any reported cases from them. Eventually, we will further remove these 9 countries from the data. The final data set includes only 179103 records and covers 227 countries.

The top 9 countries with the most infected cases are given below in decending order.

Country Total.Infected.Cases
United States of America 78226534
India 42931045
Brazil 28768104
France 22032795
The United Kingdom 18976756
Russian Federation 16495369
Germany 14867218
Turkey 14088565
Italy 12783924

The time series plot of the number of new infected cases across time for the top 9 most severe countries is given below.

The time series plot of the cumulative infected cases across time for the top 9 most severe countries is given below.

It is not surpring that the United States has the most number of infected cases since we already heard it from the news. For India, the population size is large and the hygiene standard is low, it is not suprising that India has the second most number of infected cases.

The time series plot of the number of new deceased cases across time for the top 9 most severe countries is given below.

The time series plot of the cumulative deceased cases across time for the top 9 most severe countries is given below.

The top country with the most deceased cases is still dominated by the United States. However, the second most number of death cases is not India. Brazil has the second most number of deceased cases. India has the third most number of deceased cases. It also reflects that the burden on health systems in Brazil. Due to lack of timely and adequate care, more patients infected by COVID-19 are dead at home. It contributes the growth in population mortality.

Spaghetti plots to view the number of new cases across time for each of the 6 WHO regions are given below.

Spaghetti plots to view the number of cumulative of the new cases across time for each of the 6 WHO regions are given below.

To answer our primary and secondary question of interest, we focus on the date March 1, 2022. We will focus on the cumulative infected cases and deceased cases for each country. Besides, we will also calculate the case fatality rate (CFR). The summary statistics for the cumulative infected cases, the cumulative deceased cases and the case fatality rate on March 1, 2022 is given in the table below.

Cumulative_cases Cumulative_deaths CFR
mean 1921066.75770925 26249.0440528634 0.0139145312689676
sd 6777848.47526736 92503.5988755424 0.0160318498698666
min 4 0 0
Q1 22702 174.5 0.00541735868387036
median 160028 1923 0.0106174621173075
Q3 948406 11583 0.0188998519363165
max 78226534 942804 0.181377962789907
n 227 227 227
missing 0 0 0

Boxplot for each variable is given below.

Since all the three variables are skewed to right, it is hard to see the shape of the distribution. Therefore, I take the log transformation before we draw the histogram.

South-East Asia region includes much less countries compared to other regions. None of the distributions follow close to normal.

An interactive scatterplot plot to shows the number of deaths against the number of cases are given below.

## `summarise()` has grouped output by 'Date_reported'. You can override using
## the `.groups` argument.

5. Inferential Analysis

The primary question of interest is to detect any difference in the case fatality rate, (CFR), defined in Onder et al. (2020), among the six WHO regions. First, we demote \(\mu_{AF},\mu_{AM},\mu_{EM},\mu_{EU},\mu_{SA}\) and \(\mu_{WP}\) are the true mean of the case fatality rate in regions Afirca, Americas, Eastern Mediterranean, Europe, South-East Asia and Western Pacific on March 1, 2022 respectively. The hypotheses for the primary question of interest is

\(H_0: \mu_{AF}=\mu_{AM}=\mu_{EM}=\mu_{EU}=\mu_{SA}=\mu_{WP}\) against

\(H_1:\) not all means in \(H_0\) are equal.

We are going to use ANOVA to answer this question. The ANOVA model is given below.

\[ Y_{ij} = \mu+\alpha_i+\epsilon_{ij}\text{ for }i=1, 2, \cdots, 6, j=1, 2, \cdots, n_i\]

where \(\sum_{i=1}^6 \alpha_i =0\). \(n_i\) is the number of countries in region \(i^{th}\). \(\alpha_i\) is the main effect of region \(i^{th}\). \(\epsilon_{ij}\) is the error term for region \(i^{th}\) and country \(j^{th}\). We assume \(\epsilon_{ij}\sim_{iid} N(0,\sigma^2)\) for \(i=1, 2, \cdots, 6\) and \(j=1, 2, \cdots,n_i\). The ANOVA table is given below.

Df Sum.Sq Mean.Sq F.value p-value
WHO_region 5 0.0073692 0.0014738 6.422242 1.36e-05
Residuals 221 0.0507174 0.0002295 NA NA

The p-value \(1.357496\times 10^{-5}\). Since the p-value is less than 0.05, we reject null at 5% level of significance. We conclude that not all the mean CFR are the same among the WHO regions. Furthermore, we can run the Tukey post-hoc tests to find out which pairs are significant different in mean CFR. The set of confidence intervals on the differences between the means of the levels of WHO region with 95% family-wise probability of coverage are given in the table below.

diff lwr upr p adj
Americas-Africa -0.0018 -0.0103 0.0067 0.9905
Eastern Mediterranean-Africa 0.0122 0.0010 0.0234 0.0235
Europe-Africa -0.0061 -0.0145 0.0022 0.2894
South-East Asia-Africa -0.0022 -0.0173 0.0129 0.9983
Western Pacific-Africa -0.0097 -0.0199 0.0005 0.0731
Eastern Mediterranean-Americas 0.0140 0.0030 0.0250 0.0041
Europe-Americas -0.0043 -0.0124 0.0037 0.6393
South-East Asia-Americas -0.0004 -0.0154 0.0145 1.0000
Western Pacific-Americas -0.0079 -0.0179 0.0021 0.2071
Europe-Eastern Mediterranean -0.0183 -0.0291 -0.0075 0.0000
South-East Asia-Eastern Mediterranean -0.0144 -0.0310 0.0022 0.1301
Western Pacific-Eastern Mediterranean -0.0219 -0.0342 -0.0096 0.0000
South-East Asia-Europe 0.0039 -0.0110 0.0188 0.9747
Western Pacific-Europe -0.0036 -0.0134 0.0062 0.9008
Western Pacific-South-East Asia -0.0075 -0.0234 0.0085 0.7588

If the interval does not cover 0, it means the mean CFRs are significantly different between the two regions. From the table above, we can see Eastern Mediterranean and Africa, Eastern Mediterranean and Americas, Eastern Mediterranean and Europe, Eastern Mediterranean and Western Pacific have different mean CFRs. In other words, Eastern Mediterranean region has significant difference in mean CFR than other region except South-East Asia.

For the secondary question of interest, we would like to group the countries to another 6 clusters with the cumulative infected cases and the cumulative deceased cases on March 1, 2022. Currently, we can visualize how the WHO regions are separated by the cumulative infected cases and the cumulative deceased cases on March 1, 2022.

Since the distributions of both cumulative deceased cases and cumulative infected cases are highly skewed to right, so it is hard to visualize or cluster in this case. Therefore, I decide to take log transform on each variable before we cluster them. Since some values are 0, it returns NA if we take log directly. Therefore, I add 0.01 to each observed values before I take the log transformation.

It is not easy to identify the location of points for each region. Therefore, I separate the scatterplot by region. The plot is given below.

In this scatterplot, one point represents one country. Generally, the points or countries tend to stick together if they are coming from the same WHO region. Now, we are going to use spectral clustering to group the countries again. The scatterplot of log cumulative deceased cases against log cumulative infected cases by new clusters is given below.

Since spectral clustering is to identify communities of points in the scatterplot by edges connecting them, it is not surprising the points are grouped in a neat way. In other words, we will not be able to recover WHO regions with this method.

6. Sensitivity Analysis

For the primary question of interest, we need normality assumption on CFR for each region. We can going to check the normality assumption on CFR. The normal probability plot for the CFRs is given below.

Since most of the points do not fall on the linear line, so the normality assumption on the distribution of CFR is invalid. Also, outliers are detected on the boxplot in descriptive analysis section. Thus, ANOVA model should not be used. Instead, we should use non-parametric method. For example, we can use Kruskal-Wallis Rank Sum Test. The idea is similar to ANOVA. However, we are going to replace all the observed CFRs by its ranks instead. Since Kruskal-Wallis Rank Sum Test is a non-parametric test, it is distribution free. Therefore, it has less assumptions than ANOVA. It basically just assumes the observed values are randomly selected.

The p-value is \(4.065\times 10^{-6}\). Since the p-value is less than 0.05, we reject null at 5% level of significance. We conclude that not all the mean case fatality rates are the same among the six WHO regions. We reach the same conclusion as ANOVA test.

7. Conclusion and Discussion

The primary question of interest is to detect any difference in the case fatality rate (CFR) among the six WHO regions. ANOVA results show that not all the mean CFRs are the same among the six WHO regions. Tukey post-hoc test is used to test which pair of regions have significant difference in mean CFRs. With 95% family-wise level of significance, the result shows that Eastern Mediterranean region has significant difference in mean CFR than other region except South-East Asia. The result is questionable since the normality assumption for ANOVA does not fulfill. Kruskal-Wallis Test is used instead. Kruskal-Wallis Test is a non-parametric test. Although the power is much smaller, the method is distribution free. What is more, it does not affect by outliers. The Kruskal-Wallis test also conclude that the mean of CFRs are not all the same among six WHO regions.

Besides, we visualize how WHO regions distributed based on the log cumulative deceased cases and log cumulative infected cases. The countries from the same region tend to stick together on the scatterplot of the log cumulative deceased cases and log cumulative infected cases. Since those regions are formed geographically, it is reasonable the number of cumulative infected cases or deceased cases are similar. We try to produce new clusters based on the log cumulative deceased cases and log cumulative infected cases. The resulting clusters do not reflect the WHO regions. More variables may need in order to obtain compariable clusters as WHO regions. For example, lattitude and longitude could be useful to cluster them to groups similar to WHO groups. However, we think the new clusters also represent some kind of bounding among the countries. The relationship of the countries in each cluster or between cluster can be studied in the future.

8. References

  1. Onder, G., Rezza, G., & Brusaferro, S. (2020). Case-fatality rate and characteristics of patients dying in relation to COVID-19 in Italy. Jama, 323(18), 1775-1776.

  2. He, F., Deng, Y., & Li, W. (2020). Coronavirus disease 2019: What we know?. Journal of medical virology, 92(7), 719-725.

  3. Singhal, T. (2020). A review of coronavirus disease-2019 (COVID-19). The indian journal of pediatrics, 87(4), 281-286.

  4. Shuja, J., Alanazi, E., Alasmary, W., & Alashaikh, A. (2021). COVID-19 open source data sets: a comprehensive survey. Applied Intelligence, 51(3), 1296-1325.

  5. Dey, S. K., Rahman, M. M., Siddiqi, U. R., & Howlader, A. (2020). Analyzing the epidemiological outbreak of COVID‐19: A visual exploratory data analysis approach. Journal of medical virology, 92(6), 632-638.

  6. Kraemer, M. U., Yang, C. H., Gutierrez, B., Wu, C. H., Klein, B., Pigott, D. M., … & Scarpino, S. V. (2020). The effect of human mobility and control measures on the COVID-19 epidemic in China. Science, 368(6490), 493-497.

  7. Tian, H., Liu, Y., Li, Y., Wu, C. H., Chen, B., Kraemer, M. U., … & Dye, C. (2020). An investigation of transmission control measures during the first 50 days of the COVID-19 epidemic in China. Science, 368(6491), 638-642.

  8. US Department of Health and Human Services. (2020). COVID-19 guidance for hospital reporting and FAQs for hospitals, hospital laboratory, and acute care facility data reporting. updated July, 29.

Appendix: All code for this report

# some required libraries for this project
library(tidyverse)
library(ggplot2)
library(lubridate)
library(tidyquant)
library(plotly)
library(gridExtra)
library(kernlab)
# read the covid data
covid <- read_csv("https://covid19.who.int/WHO-COVID-19-global-data.csv")

# only limit up to 03-01-2022, so the numbers on the report won't change 
# when we run it again
covid2 <- covid %>% 
  mutate(date = ymd(Date_reported)) %>%
  filter(date <= ymd("2022-03-01"))

# number of records
num_record = nrow(covid2) # 186993

# number of records with country or WHO_region "Other"
num_other = sum(covid2$Country == "Other") # 789

# filter out region = "Other" or country = "Other", and code WHO_region
covid3 <- covid2 %>%
  filter(WHO_region != "Other") %>%
  mutate(
    WHO_region = fct_recode(
      WHO_region,
      "Eastern Mediterranean" = "EMRO",
      "Europe" = "EURO",
      "Africa" = "AFRO",
      "Western Pacific" = "WPRO",
      "Americas" = "AMRO",
      "South-East Asia" = "SEARO"
  )
)

# new number of records
new_num_record = nrow(covid3) # 186204

# countries with 0 infected cases
cum_cases = with(covid3, tapply(New_cases, Country, sum)) 
countries_with_0_cases = names(cum_cases)[cum_cases == 0]

# finalize covid data set, filter out countries without any infected cases
covid4 = covid3 %>% filter(!covid3$Country %in% countries_with_0_cases)

# covered unique countries
covid4_num_covered_countries = length(unique(covid4$Country)) # 227 countries


total_cases_by_country <- tapply(covid4$New_cases, covid4$Country, sum)
total_cases_by_country_sorted <- sort(total_cases_by_country, decreasing = TRUE)
total_cases_by_country_top10 <- head(total_cases_by_country_sorted, 9)
total_cases_by_country_top10_table <- data.frame(Country = names(total_cases_by_country_top10), 
                                                 `Total Infected Cases` = as.numeric(total_cases_by_country_top10))
knitr::kable(total_cases_by_country_top10_table)
selected_countries <- names(total_cases_by_country_top10)
covid_filter_top10 <- covid4 %>% filter(Country %in% selected_countries)
covid_filter_top10 %>% 
  mutate(date = ymd(Date_reported)) %>% 
  ggplot(aes(x = date, y = New_cases, color = Country, group = Country)) + 
  geom_line() +
  ggtitle("The number of new infected cases across time for the top 9 most severe countries") + 
  labs(y = "Number of New Infected Cases", x = "Date") + 
  facet_wrap(~Country)
covid_filter_top10 %>% 
  mutate(date = ymd(Date_reported)) %>% 
  ggplot(aes(x = date, y = Cumulative_cases, color = Country, group = Country)) + 
  geom_line() + 
  ggtitle("The cumulative infected cases across time for the top 9 most severe countries") + 
  labs(y = "Number of Infected Cases in Total", x = "Date")
selected_countries <- names(total_cases_by_country_top10)
covid_filter_top10 <- covid4 %>% filter(Country %in% selected_countries)
covid_filter_top10 %>% 
  mutate(date = ymd(Date_reported)) %>% 
  ggplot(aes(x = date, y = New_deaths, color = Country, group = Country)) + 
  geom_line() +
  ggtitle("The number of new deceased cases across time for the top 9 most severe countries") + 
  labs(y = "Number of New Deceased Cases", x = "Date") + 
  facet_wrap(~Country)
covid_filter_top10 %>% 
  mutate(date = ymd(Date_reported)) %>% 
  ggplot(aes(x = date, y = Cumulative_deaths, color = Country, group = Country)) + 
  geom_line() + 
  ggtitle("The cumulative deceased cases across time for the top 9 most severe countries") + 
  labs(y = "Number of Deceased Cases in Total", x = "Date")
WHO_region_group <- unique(covid4$WHO_region)
n_region <- length(WHO_region_group) # number of WHO regions
fig.spaghetti.list <- vector("list", length = n_region)

for (i in 1:n_region){
fig.spaghetti.list[[i]] <- covid4 %>% 
  filter(Date_reported>= "2020-01-03", Date_reported<= "2022-03-01", WHO_region==WHO_region_group[i]) %>% 
  mutate(Date=as.Date(Date_reported)) %>%
  ggplot(aes(x=Date,y=New_cases,by=Country)) +
  geom_line(aes(color=Country)) +
  theme(legend.position ='none') + 
  labs(title = paste("New cases for region:", WHO_region_group[i]), 
       y = "New Infected Cases")
}

grid.arrange(fig.spaghetti.list[[1]], 
             fig.spaghetti.list[[2]], 
             fig.spaghetti.list[[3]], 
             fig.spaghetti.list[[4]], 
             fig.spaghetti.list[[5]], 
             fig.spaghetti.list[[6]], 
             nrow=3, ncol=2)

fig.spaghetti.list <- vector("list", length = n_region)

for (i in 1:n_region){
fig.spaghetti.list[[i]] <- covid4 %>% 
  filter(Date_reported>= "2020-01-03", Date_reported<= "2022-03-01", WHO_region==WHO_region_group[i]) %>% 
  mutate(Date=as.Date(Date_reported)) %>%
  ggplot(aes(x=Date,y=Cumulative_cases,by=Country)) +
  geom_line(aes(color=Country)) +
  theme(legend.position ='none') + 
  labs(title = paste("Cumulative infected cases for region:", WHO_region_group[i]), 
       y = "Cumulative Infected Cases")
}

grid.arrange(fig.spaghetti.list[[1]], 
             fig.spaghetti.list[[2]], 
             fig.spaghetti.list[[3]], 
             fig.spaghetti.list[[4]], 
             fig.spaghetti.list[[5]], 
             fig.spaghetti.list[[6]], 
             nrow=3, ncol=2)

# get date range from current covid data
date_range <- range(covid4$Date_reported)

# get one row for one country for the record up to 03-01-2022
covid_mini = covid4 %>% filter(Date_reported == "2022-03-01")

# calculate case fatality rate (CFR)
covid_mini$CFR = covid_mini$Cumulative_deaths/covid_mini$Cumulative_cases

# summary statistics for cumulative infected cases, cumulative deceased cases, CFR
covid_mini_reduce = covid_mini[, c("WHO_region", "Cumulative_cases", "Cumulative_deaths", "CFR")]

# function to summarize the statistics
calculate_summary = function(x){
  out <- data.frame(mean = mean(x, na.rm = TRUE),
                    sd = sd(x, na.rm = TRUE),
                    min = min(x, na.rm = TRUE),
                    Q1 = quantile(x, 0.25, na.rm = TRUE),
                    median = median(x, na.rm = TRUE), 
                    Q3 = quantile(x, 0.75, na.rm = TRUE), 
                    max = max(x, na.rm = TRUE),
                    n = length(x), 
                    missing = sum(is.na(x)))
  return(out)
}

# calculate summary statistics 
knitr::kable(sapply(covid_mini_reduce[,-1], calculate_summary))

# change the data to long format
covid_mini_long <-
  gather(covid_mini_reduce,
         variable,
         value,
         Cumulative_cases:CFR,
         factor_key = TRUE)

# log format
covid_mini_log = covid_mini_long
covid_mini_log$value = log(covid_mini_long$value)

# boxplot

ggplot(covid_mini_log) + 
  aes(x = variable, y = value, fill = WHO_region) + 
  geom_boxplot() + 
  labs(title = "Boxplot of each variable by WHO region")


# histogram of each variable by WHO Region
ggplot(covid_mini_log) + aes(x = value, fill = WHO_region ) + 
  geom_histogram() + facet_grid(WHO_region ~ variable) + 
  labs(title = "Histogram of the log of each variable by WHO region")


covid4 %>% 
  filter(Date_reported>= "2020-01-03", Date_reported <= "2022-03-01") %>% 
  group_by(Date_reported,WHO_region) %>%
  summarize(deaths = sum(New_deaths),
            cases = sum(New_cases)) %>% 
  mutate(Days_2021 = Date_reported- as.Date("2021-01-01")) %>%
  plot_ly(
    x= ~cases,
    y= ~deaths,
    frame = ~Days_2021,
    text=~WHO_region,
    hoverinfo="WHO_region",
    color=~WHO_region,
    type = 'scatter',
    mode = 'markers',
    showlegend = T
  )

# get CFR data in long format
covid_CFR_long = covid_mini_long %>% filter(variable  == "CFR")

# run anova
aov1 = aov(value ~ WHO_region, data = covid_CFR_long)

# ANOVA table
aov1_table = data.frame(summary(aov1)[[1]])
names(aov1_table)[5] = "p-value"
knitr::kable(aov1_table)
hsd = TukeyHSD(aov1)
knitr::kable(round(hsd$WHO_region, 4))

# get cumulative data in long format
covid_cum_wide <- covid_mini_reduce[,c("WHO_region", 
                                       "Cumulative_cases", 
                                       "Cumulative_deaths")]

# visualize the current cluster methods with WHO groups
ggplot(covid_cum_wide) + 
  aes(x = Cumulative_cases, y = Cumulative_deaths, col = WHO_region) +
  geom_point() + 
  labs(title = "Scatterplot of Cumulative deceased cases vs Cumulative infected cases",
       y = "Cumulative deceased cases", 
       x = "Cumulative infected cases")

# log transformation
covid_cum_wide$log_Cumulative_cases <- log(0.01+covid_cum_wide$Cumulative_cases)
covid_cum_wide$log_Cumulative_deaths <- log(0.01+covid_cum_wide$Cumulative_deaths)

# visualize the current cluster methods with WHO groups
ggplot(covid_cum_wide) + 
  aes(x = log_Cumulative_cases, y = log_Cumulative_deaths, col = WHO_region) +
  geom_point() + 
  labs(title = "Scatterplot of log cumulative deceased cases vs log cumulative infected cases",
       y = "log cumulative deceased cases", 
       x = "log cumulative infected cases")
ggplot(covid_cum_wide) + 
  aes(x = log_Cumulative_cases, y = log_Cumulative_deaths, col = WHO_region) +
  geom_point() + 
  labs(title = "Scatterplot of log cumulative deceased cases vs log cumulative infected cases by WHO Regions",
       y = "log cumulative deceased cases", 
       x = "log cumulative infected cases") + 
  facet_wrap(~WHO_region)

set.seed(2022)

# spectral clustering
spec.cluster <- covid_cum_wide %>% 
  select(log_Cumulative_cases, log_Cumulative_deaths) %>% 
  as.matrix() %>% specc(centers=6)

# store the cluster in main data
covid_cum_wide$cluster <- factor(spec.cluster@.Data)

# scatterplot to show the clusters
ggplot(covid_cum_wide) + 
  aes(x = log_Cumulative_cases, y = log_Cumulative_deaths, color = cluster) +
  geom_point() + 
  labs(title = "Scatterplot of log cumulative deceased cases vs log cumulative infected cases by new clusters",
       y = "log cumulative deceased cases", 
       x = "log cumulative infected cases")
with(covid_CFR_long, qqnorm(value))
with(covid_CFR_long, qqline(value))
# KW test
kw1 = kruskal.test(value ~ WHO_region, data = covid_CFR_long)